{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "## TorchServe Continuous Batching Serve Llama-2-70B and Llama-3-70B on Inferentia-2\n",
    "This notebook demonstrates TorchServe continuous batching serving Llama-2-70b and Llama-3-70b on Inferentia-2 `inf2.48xlarge` with Neuron DLAMI Deep Learning AMI Neuron (Ubuntu 22.04) 20240401 and Neuron DLC [public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.18.2-ubuntu20.04](https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#pytorch-inference-neuronx)"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Installation"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Activate Transformers NeuronX (PyTorch 2.1) Python venv\n",
    "!source /opt/aws_neuronx_venv_transformers_neuronx/bin/activate\n",
    "\n",
    "# Install torch-model-archiver\n",
    "!pip install torch-model-archiver\n",
    "\n",
    "# Clone Torchserve git repository\n",
    "!git clone https://github.com/pytorch/serve.git\n",
    "\n",
    "# Install dependencies, now all commands run under serve dir\n",
    "!cd serve"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Create model artifacts\n",
    "Here we use llama3 model config file as an example"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "# login in Hugginface hub\n",
    "!pip install --upgrade huggingface_hub\n",
    "!huggingface-cli login --token $HUGGINGFACE_TOKEN\n",
    "!python examples/large_models/utils/Download_model.py --model_path model --model_name meta-llama/Llama-3-70b-hf --use_auth_token True\n",
    "\n",
    "# Create TorchServe model artifacts\n",
    "!torch-model-archiver --model-name llama-3-70b --version 1.0 --handler examples/large_models/inferentia2/llama/continuous_batching/base_neuronx_continuous_batching_handler.py -r examples/large_models/inferentia2/llama/requirements.txt --config-file examples/large_models/inferentia2/llama/continuous_batching/llama3-model-config.yaml --archive-format no-archive\n",
    "\n",
    "!mkdir -p model_store\n",
    "!mv llama-3-70b model_store\n",
    "!mv model model_store/llama-3-70b"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Start docker"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "!docker run --rm -it -v /home/ubuntu/serve/model_store/:/opt/ml/model -v /home/ubuntu/serve/:/home/model-server/serve --device /dev/neuron0:/dev/neuron0  --device /dev/neuron1:/dev/neuron1  --device /dev/neuron2:/dev/neuron2  --device /dev/neuron3:/dev/neuron3  --device /dev/neuron4:/dev/neuron4  --device /dev/neuron5:/dev/neuron5 --device /dev/neuron6:/dev/neuron6 --device /dev/neuron7:/dev/neuron7  --device /dev/neuron8:/dev/neuron8  --device /dev/neuron9:/dev/neuron9  --device /dev/neuron10:/dev/neuron10  --device /dev/neuron11:/dev/neuron11 -p 127.0.0.1:8080:8080 -p 127.0.0.1:8081:8081 -p 127.0.0.1:8082:8082 -p 127.0.0.1:7070:7070 -p 127.0.0.1:7071:7071 -e TS_INSTALL_PY_DEP_PER_MODEL=true public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.18.2-ubuntu20.04"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Run inference"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Run single inference request\n",
    "!python examples/large_models/utils/test_llm_streaming_response.py -m llama-3-70b -o 50 -t 2 -n 4 --prompt-text \"Today the weather is really nice and I am planning on \" --prompt-randomize"
   ],
   "metadata": {
    "collapsed": false
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
